Add a HybridReader for use in write constrained databases by jimhester · Pull Request #423 · posit-dev/ggsql

jimhester · 2026-05-01T20:54:17Z

Summary

Adds HybridReader, a Reader that composes any primary reader (the
"data" side) with an in-process DuckDBReader (the "staging" side).
register() writes go to staging; execute_sql routes queries that
mention any registered name to staging, and everything else to the
primary.

Behind the existing duckdb feature flag — no new feature, no new
dependencies.

Companion design comment with the broader sequencing context:
#341 (comment). Related to but implementation distinct from #422

Motivation

Some data sources are read-only by nature (Flight SQL servers, anonymous
Trino) or expensive to write to repeatedly during visualization
iteration (Snowflake). HybridReader composes a primary reader (the
remote data source) with a local DuckDB instance (staging). register()
writes to staging — sidestepping read-only or auth restrictions — while
execute_sql routes queries to the right side based on which tables
they reference. Same Reader interface; no caller-visible difference.

The design also pairs naturally with a query-result cache (PR3) that
memoizes remote query results in the staging DuckDB. The cache isn't in
this PR, but the staging plumbing it relies on is.

Design

HybridReader owns:

data: Box<dyn Reader + Send> — the primary backend.
staging: DuckDBReader — an in-process DuckDB instance.
staged_names: RefCell<HashSet<String>> — the names register() has
put into staging.

The routing predicate references_staged_name is a lightweight SQL
scanner — not a full parser. It checks whether any registered name
appears as a SQL identifier (with identifier-boundary respect, qualified
references like catalog.schema.name, double-quoted identifiers, and
single-quoted-string-literal exclusion). Comments are not currently
parsed: a stray identifier inside a -- comment could route a
primary-data query to staging, where it would fail with a clear error
rather than succeeding against the primary backend.

Reader::dialect() returns the staging DuckDB dialect, because all
internally-generated SQL (stat transforms, layer filters, temp-table
DDL) targets staging. Callers that need the primary's dialect (e.g.
schema introspection of the remote catalog) get it via the inherent
HybridReader::data_dialect() method.

Limitations (documented)

A single SQL statement cannot reference both staged names and
primary-data tables. Queries are dispatched whole; cross-backend joins
are unsupported. Materialize one side into staging first if you need to
combine them. There is a regression test pinning this behavior.

Staged data lives in the in-process DuckDB instance and is released
when the HybridReader is dropped — no spill-to-disk, no shared cache.

Testing

All tests are offline, no external setup:

Routing scanner (9 tests): empty registered-name set, no match,
single match, rejection of longer-identifier overlap (orders should
not match orders_detail), rejection of identifier-prefix overlap
(col should not match col_id), rejection of single-quoted-string
contents, match of double-quoted identifiers, match of qualified
references (catalog.schema.orders), and SQL-standard '' escape
inside a string literal.
Reader behavior (5 tests): register delegates to staging and
tracks the name; execute_sql routes a registered name to staging;
execute_sql routes an unregistered name to data; unregister
delegates and untracks; dialect() returns the staging dialect with
a discriminating SQLite-on-the-data-side setup.
Cross-side limitation (1 test): a query referencing both staged
and primary-only names routes wholly to staging and surfaces a
staging-side error rather than silently joining. The setup
discriminates correct routing from a wrong-route that would
otherwise succeed.

The dialect-discrimination test uses a SqliteReader for the data
side (Ansi CASE-form sql_greatest) against a DuckDB staging
(GREATEST(a, b)), so a regression that returned the data dialect
instead of staging's would fail visibly. Gated on the sqlite feature,
which is in upstream's default feature set.

What's next

Per the design comment, a follow-up PR adds:

PR3: A query-result cache in the staging DuckDB
(hybrid_cache.rs), a Reader::clear_cache() trait default,
Vega-Lite v5+v6 mime emission in the Jupyter kernel, and the
-- @uncache Jupyter meta-command.

The cache makes the iterate-on-remote case sub-millisecond on cache
hits while keeping the same Reader interface; it's gated by an env
var and fronted by a public CacheConfig for callers that want to
tune TTL or the byte budget.

Wraps any Reader (the data side) with an in-process DuckDBReader (the staging side). register() writes to staging; execute_sql routes whole queries to staging or the primary based on whether they reference any registered name. Behind the existing 'duckdb' feature. Tests cover the routing scanner (identifier-boundary checks, qualified references, double-quoted identifiers, string-literal exclusion), register/unregister name tracking, dialect dispatch, and the documented cross-side limitation.

Per code review: the original tests for routing direction and dialect selection used identical setup on both sides, so they passed regardless of impl correctness. The dialect test now uses a SqliteReader on the data side (SQLite dialect) so the staging-vs-data distinction surfaces in sql_greatest output, and the cross-side test now registers staged_only in both data and staging with different values so a wrong-route would succeed silently rather than erroring for the same reason as the correct route. Also corrects an inverted "false-negative" label and softens the misleading "comments are harmless" note in the references_staged_name doc-comment.

Decode embedded parquet bytes via Arrow and register through the `arrow` virtual table function instead of writing a temp file and calling `read_parquet`. The latter triggers DuckDB's autoloadable parquet extension, which fails in offline or network-restricted environments (observed as a flaky CI failure on `test_ribbon_transposed_vegalite_encoding`). Mirrors the loader path SqliteReader already uses.

# Conflicts: # CHANGELOG.md

jimhester · 2026-05-21T16:28:43Z

Any updates around this? I'd like to open a PR adding caching of the viz side, but it requires this PR. I can stack it on top of this PR if would would want to see how it works, let me know.

georgestagg · 2026-05-22T08:52:51Z

Hi Jim, sorry about the delayed review. It's my fault, I've been busy with other projects.

I don't feel equipped to make the decision alone on this, because the introduced primary/staging mechanism such a fundamentally different way to perform the computations required for visualisation. I am meeting with the team next week to discuss it further.

I am not saying "no". In fact I am sure we will have some form of this tiered approach go in. As you say, it's a requirement for local caching and will be a requirement especially when it comes to interactivity. However, I'm not yet convinced that the Reader is the right place for this mechanism to live long term, and there are some other questions I'd like to hash out with the team first.

Either way, I'll keep you updated, the PR has not been forgotten.

georgestagg · 2026-05-27T11:05:29Z

Hi Jim,

We've discussed things a bit more today, I have a summary of our thoughts about the hybrid reader approach.

We do indeed want a hybrid/staging system, to allow for interactivity and caching.
Rather than this being a Reader than composes a reader with a duckdb process, we think that we should have a new concept that sits between that of the reader and the rest of ggsql. We're tentatively calling this concept "staging".
The staging layer should be more general than any specific reader's SQL engine (i.e. duckdb). Though, we agree than duckdb is likely to be the most appropriate and commonly used engine for staging in most cases.
The only requirements for a Reader to act as a staging layer should be that it is writable.
We might employ heuristics in the future to allow for automatic/magic use of a staging layer. But, at least initially, we should expose the staging system explicitly with,
- Some --staging or similar command in the CLI.
- Some support for indicating the staging layer in connection strings. Maybe something like staging://sqlite+duckdb://some-file.db, but we have not decided fully yet and details TBC.

We think it is likely this will be implemented in parallel to this PR, rather than based upon it. But, I won't be able to actually start the work proper for a couple of weeks, so I won't close the PR right now either.

I'd like to open a PR adding caching of the viz side, but it requires this PR. I can stack it on top of this PR if would would want to see how it works, let me know.

I think stacked PRs in draft mode would be best for us, that would let us see how your caching mechanism works without having to wait for things to hit main. That would be good, in that we definitely would like to design the staging system to minimise distribution to your plans for caching.

Resolves conflicts: - CHANGELOG.md: keep HybridReader Unreleased entry alongside upstream's new Unreleased additions (AdbcReader, aggregate, panel decorations, side setting). - src/reader/mod.rs: keep both `pub use HybridReader` and the new `pub use AdbcReader` lines. - src/reader/data.rs + duckdb.rs: drop PR2's `register_builtin_datasets_duckdb` in data.rs in favor of upstream's read_parquet-based implementation in duckdb.rs (private), which has spatial-extension loading for the new `world` dataset that PR2's arrow-vtab path lacks. The flaky parquet-extension CI fix on the Netflix fork can be re-applied as a follow-up if posit-dev sees it too. Also picks up upstream's 0.3.3 release, the `world` builtin dataset, spatial layer, polar decorations, aggregate stat, ADBC reader (posit-dev#422), and many docs/tooling additions.

jimhester · 2026-05-29T15:40:03Z

Thanks for the writeup, the staging-layer framing makes sense, and we're happy to defer to whatever shape you land on for the long-term design.

I've opened the caching work as a draft stacked PR over in my fork so you can see the mechanism without it cluttering this repo's PR queue:

jimhester/ggsql#2 — pr3-hybrid-cache → pr2-hybrid-reader (in-fork base so the diff shows only the cache work, ~1.1 KLOC across src/reader/hybrid_cache.rs, integration in hybrid.rs, a Reader::clear_cache() default, and the -- @uncache + dual Vega-Lite v5/v6 mime additions in the Jupyter kernel).

A few notes for context:

Default: enabled, 300s TTL, 512 MB byte-budget LRU. Toggled per-instance via HybridReader::with_cache_config(...) or globally via GGSQL_HYBRID_CACHE_DISABLED=1.
Cache key is (per-HybridReader UUID, sql). Empty-width DataFrames (DDL-style results) bypass caching since DuckDB's arrow(...) table function rejects zero-column schemas.
Tests include cache hit/miss, TTL=0, LRU eviction, clear_cache, and a set of cross-backend equivalence tests against ADBC SQLite (gated behind --features "adbc duckdb sqlite" -- --ignored cache_equivalence).

Reading the description's "what's next" gives a quick sense of how this would map onto a more general staging layer. Am happy to adapt the cache module to sit on top of whatever interface you land on, since the cache primitives (key derivation, meta-table DDL, lookup/insert/touch/drop/evict) are reader-agnostic.

jimhester added 5 commits May 1, 2026 16:01

feat(reader): export HybridReader behind 'duckdb' feature

27a4b8b

style(hybrid): apply cargo fmt + clippy fixes

4df07d2

docs(changelog): announce HybridReader

138ddee

thomasp85 requested a review from georgestagg May 4, 2026 08:51

jimhester added 2 commits May 4, 2026 11:43

Merge remote-tracking branch 'upstream/main' into pr2-hybrid-reader

7aa76ae

# Conflicts: # CHANGELOG.md

jimhester mentioned this pull request May 29, 2026

PR3: HybridReader query-result cache + clear_cache trait + @uncache + dual VL mime jimhester/ggsql#2

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a HybridReader for use in write constrained databases#423

Add a HybridReader for use in write constrained databases#423
jimhester wants to merge 8 commits into
posit-dev:mainfrom
jimhester:pr2-hybrid-reader

jimhester commented May 1, 2026 •

edited

Loading

Uh oh!

jimhester commented May 21, 2026

Uh oh!

georgestagg commented May 22, 2026

Uh oh!

georgestagg commented May 27, 2026

Uh oh!

jimhester commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jimhester commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Limitations (documented)

Testing

What's next

Uh oh!

jimhester commented May 21, 2026

Uh oh!

georgestagg commented May 22, 2026

Uh oh!

georgestagg commented May 27, 2026

Uh oh!

jimhester commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimhester commented May 1, 2026 •

edited

Loading

jimhester commented May 29, 2026 •

edited

Loading